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Abstract 

A method of 'network filtering' has been proposed recently to detect the effects of certain external perturbations on the 

, , interacting members in a network. However, with large networks, the goal of detection seems a priori difficult to achieve, 

H 

I ^ ' especially since the number of observations available often is much smaller than the number of variables describing the effects 

oo : of the underlying network. Under the assumption that the network possesses a certain sparsity property, we provide a formal 

characterization of the accuracy with which the external effects can be detected, using a network filtering system that combines 
Lasso regression in a sparse simultaneous equation model with simple residual analysis. We explore the implications of the 
technical conditions underlying our characterization, in the context of various network topologies, and we illustrate our method 
using simulated data. 
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I. Introduction 

A canonical problem in statistical signal and image processing is the detection of localized targets against 
complex backgrounds, which often is likened to the proverbial task of 'finding a needle in a haystack'. 
In this paper, we consider the task of detecting such targets when the 'background' is neither a one- 
dimensional signal nor a two-dimensional image, but rather consists of the 'typical' behavior of interacting 
units in a network system. More specifically, we assume network-indexed data, where measurements are 
made on each of the units in the system and the interaction among these units manifests itself through the 
correlations among these measurements. Then, given the possible presence of an external effect applied 
to a unit(s) of this system, we take as our goal the task of identifying the location and magnitude of this 
effect. It is expected that evidence of this effect be diffused throughout the system, to an extent determined 
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Fig. 1, Schematic illustration of the network filtering process proposed in this paper, shown in two stages. In the first stage, the aim is to recover information 
on the correlation (i.e., B) among the five network units, given training data Y. In the second stage, that information is used to filter new data Y, produced 
in the presence of an effect external to the system (i.e., <f>), so as to detect the target of that effect. 



by the underlying network of interactions among system units, like the blurring of a point source in an 
image. As a result, an appropriate filtering of the observed measurements is necessary. These ideas are 
illustrated schematically in Figure [Q 

While networks have been an important topic of study for some time, in recent years there has been 
an enormous surge in interest in the topic, across various diverse areas of science. Examples include 
computer traffic networks (e.g., [|T0|), biological networks (e.g., [Q]|), social networks (e.g., E510 . and 
sensor networks (e.g., Il23l0 . Our network filtering problem was formulated by, and is largely motivated 
by the work of, Cosgrove et al. [81, who used it to tackle the problem of predicting genetic targets of 
biochemical compounds proposed as candidates for drug development. However, the problem is clearly 
general and it is easy to conceive of applications in other domains. 

The authors in [81 model the acquisition of network data, including the potential presence of targets, 
using a system of sparse simultaneous equation models (SSEMs), and propose to search for targets using 
a simple two-step procedure. In the first step, sparse statistical inference techniques are used to remove 
the 'background' of network effects, while in the second step, outlier detection methods are applied to 
the resulting network- filtered data. Empirical work presented in [81, using both simulated data and real 
data from micro-array studies, demonstrates that such network filtering can be highly effective. However, 
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there is no accompanying theoretical work in 

In this paper, we present a formal characterization of the performance of network filtering, exploring 
under what conditions the methodology can be expected to work well. A collection of theoretical results 
are provided, which in turn are supported by an extensive numerical study. Particular attention is paid to 
the question of how network structure influences our ability to detect external effects. The technical aspects 
of our work draw on a combination of tools from the literatures on sparse regression and compressed 
sensing, complex networks, and spectral graph theory. 

The remainder of the paper is organized as follows. The basic SSEM model and two-step network 
filtering methodology are presented formally in Section [III In Section [III] we characterize the accuracy 
with which the network effects can be learned from training data, while in Section [IVj we use these results 
to quantify the extent to which external effects will be evident in test data after filtering out the learned 
network effects. Numerical results, based on simulations under various choices of network structure, are 
presented in Section [V] Finally, we conclude with some additional discussion in Section [VIl Proofs of all 
formal results are gathered in the appendices. 

II. Network Filtering: Model and Methodology 

Consider a system of p units (e.g., genes, people, sensors, etc.). We will assume that we can take 
measurements at each unit, and that these measurements are likely correlated with measurements at 
some (small) fraction of other units, such as might occur through 'interaction' among the units. For 
example, in §8§, where the units are genes, the measurements are gene expression levels from microarray 
experiments. Genes regulating other genes can be expected to have expression profiles correlated across 
experiments. Alternatively, we could envision measuring environmental variables (e.g., temperature) at 
nodes in a sensor network. Sensors located sufficiently close to each other, with respect to the dynamics 
of the environmental process of interest, can be expected to yield correlated readings. 

We will also assume that there are two possible types of measurements: a training set, obtained under 
'standard' conditions, and a test set measured under the influence of additional 'external' effects. The 



training set will be used to learn patterns of interaction among the units (i.e., our 'network'), and with 
that knowledge, we will seek to identify in the test data those units targeted by the external effects. 

We model these two types of measurements using systems of simultaneous equation models (SEMs). 
Formally, suppose that for each of the p units, we have in the training set n replicated measurements, 
which are assumed to be realizations of the elements of a random vector Y = (Yi, . . . , Y p )' . Let Yi be 
the i-th element of Y, and let Yf_ $ denote all elements of Y except Y { . We specify a conditional linear 
relationship among these elements, in the form 

p 

Yi | F[_i] = y H] = ^PijVj + e; , (1) 

where represents the strength of association of the measurement for the i-th unit with that of the j-th 
unit, and e, are error terms, assumed to be independently distributed as N(0,a 2 ). That is, we specify a 
so-called 'conditional Gaussian model' for the Yi, which in turn yields a joint distribution for Y in the 
form 

Y ~ N(0, (/-5)"V), (2) 

with B being a matrix whose entry By is for % ^ j, and zero otherwise. See [0 Ch 6.3.2]. 
The matrix I — B is assumed to be positive definite. In addition, we will assume B (and hence I — B) 
to be sparse, in the sense of having a substantial proportion of its entries equal to zero. A more precise 
characterization of this assumption is given below, in the statement of Theorem 1. 

We can associate a network with this model using the framework of graphical models (e.g., lfT9l ). Let 
each unit in our system correspond to a vertex in a vertex set V = {1, . . . ,p}, and define an edge set E 
such that {i,j} 6 E if and only if B^ ^ 0. Then the model in ©, paired with the graph G = (V, E), is a 
Gaussian graphical model, with concentration (or 'precision') matrix Q — (I — B)a~ 2 and concentration 
graph G. Since we assume B to be sparse, the graph G likewise will be sparse. Gaussian graphical models 
are a common choice for modeling multivariate data with potentially complicated correlation (and hence 
dependency) structure. This structure is encoded in the graph G, and questions regarding the nature of 
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this structure often can be rephrased in terms of questions involving the topology of G. In recent years, 
there has been increased interest in modeling and inference for large, sparse Gaussian graphical models 
of the type we consider here (e.g., ifTTTl . lj2TT0 . 

For the test set, our observations are assumed to be realizations of another random vector, say Y = 
(Y"i, . . . ,Y p )', the elements of which differ from those of Y only through the possible presence of an 

additive perturbation. That is, we model each Y h conditional on the others, as 

p 

Yi | Yj_i] = = ^2 PaVi + <Pi + £i , (3) 

where fa denotes the effect of the external perturbation for the i-th unit, and the error terms Sj are again 
independently distributed as iV(0,<7 2 ). Similar to ©, we have in this scenario 

Y ~ jV ( (J — > (/-£)-V) , (4) 

where (p = (fa, . . . ,fa)'. 

The external effects are assumed unknown to us but sparse. That is, we expect only a relatively small 
proportion of units to be perturbed. Our objective is to estimate the external effects <p an d to detect which 
units i were perturbed i.e., to detect those units i for which fa stands out from zero above noise. But 
we do not observe the external effects directly. Rather, these effects are 'blurred' by the network of 
interactions captured in B, as indicted by the expression for the mean vector in ©. If B were known, 
however, it would be natural to filter the data Y, producing 

^ideal = (/ _ B } Y . (5) 

The random vector fa deal has a multivariate Gaussian distribution, with expectation <p and covariance 
(J — B)a 2 . Hence, element-wise, each fafeai distributed as N(fa,a 2 ), and therefore the detection of 
perturbed units i reduces to detection of a sparse signal against a uniform additive Gaussian noise, which 
is a well-studied problem. Note that under this model, we expect the noise in fa deal to be correlated. 
However, given the assumptions of sparsity on B, these correlations will be relatively localized. 
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Of course, B typically is not known in practice, and so (p ldeal in ([5]) is an unobtainable ideal. Studying 
the same problem, Cosgrove et al. (H proposed a two-stage procedure in which (i) p simultaneous sparse 
regressions are performed to infer B, row-by-row, yielding an estimate B, and (ii) the ideal residuals in 
© are predicted by the values 

i = (I - B)Y , (6) 

after which detection is carried ou|J They dubbed this overall process 'network filtering'. A schematic 
illustration of network filtering is shown in Figure [TJ 

Our central concern in this paper is with characterizing the conditions under which network filtering can 
be expected to work well. Motivated by the original context of Cosgrove et al., involving a network of 
gene interactions and measurements based on micro-array technology, we assume here that (i) p^> n, (ii) 
the matrix B is sparse, and (iii) the vector <p is sparse. In carrying out our study, we adopt a strategy for 
estimating B based on Lasso regression E4l . a now-canonical example of sparse regression. Specifically, 
motivated by ©, we estimate each row B i9 as 

p 

B i9 = arg min \\y { - PijVjWl + , (V) 

where fi > is a regularization parameter. Following this estimation stage, we carry out detection using 
simple rank-based procedures. 

We present our results in two stages, first describing conditions under which B estimates B accurately, 
given the system of sparse simultaneous equation models (SSEMs) defined by CO), and then discussing the 
nature of the resulting vector <\>. In both stages, we explore the implications of the topological structure 
of G on our results. 

III. Accuracy in estimation of B 

At first glance, accurate estimation of B seems impossible, since even if the error terms are small, 
this noise typically will be inflated by naive inversion of our systems of equations (i.e., because p ^> n). 

1 Technically, Cosgrove et al. work under a model that differs slightly from ours, sharing the same conditional distributions, but arrived at through specification 
of a different joint distribution. See (9] Ch 6.3] for discussion comparing such 'simultaneous Gaussian models' with our conditional Gaussian model. 



However, recent work on analogous problems in other models has shown that under certain conditions, 
and using tools of sparse inference, it is indeed possible to obtain good estimates. Results of this nature 
have appeared under the names 'compressed sensing', 'compressive sampling', and similar. See the recent 
review H, and the references therein. The following result is similar in spirit to these others, for the 
particular sparse simultaneous equation models we study here. 

Theorem 1: Assume the training model defined in © and ([2]), and set E = (/ — B)~ l a 2 . Let S be the 
largest number of non-zero entries in any row of B, and suppose that T,,S,p, and n satisfy the conditions 

we) < A+y^ y (8) 



(E) " \l-^Sj 

and 



n 



p{r) < 1 • (9) 

Here \ m ax and \ m in refer to maximum and minimum eigenvalues and p(r) = (l + /(4r)) 2 +2(l + /(5r)) 2 — 
3, where r — S/(p — 1) and f is a function to be defined lattera Finally, assume that p 2 < (Coa 2 (~ )/S, 
for a constant Co = Co(n,p, r) and (~ = n (l — 4(log 2 n/n) 1 / 2 ). Let p = n u , for v > 1. Then it follows 
that, with overwhelming probability, for every row B i9 of B the estimator B i9 in satisfies the relation 

\\B i .-B i .\\l<Ca 2 (;+ , (10) 

where = n (l + 4(log 2 n/n) 1//2 ) and C > is a constant. 

Remark 1: The accuracy of B i9 is seen in (fTOb to depend primarily on the product no 2 and on C. The 
constant C can be bounded by an expression of the form (1 — p(r))~ 2 times a constant depending only 
on the structure of E. The magnitude of C therefore is controlled essentially by the extent to which p(r) 
is less than 1, which in turn is a rough reflection of the sparsity of the network. Hence, in order to have 
good accuracy, a 2 must be small compared to n~ l . In particular, if a 2 = 0(n~ w ), for to > 1, then the 
error in (flOl) behaves roughly like O (n - ^ -1 -*). 

2 Specifically, / is defined in Section III, sub-section B, immediately following equation 1141 . 
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Remark 2: Clearly it cannot be expected that we estimate B with high accuracy in all situations. The 
expressions in © and © dictate sufficient conditions under which, with overwhelming probability 
(meaning with probability decaying exponentially in p), we can expect to do well. Due to the intimate 
connection between the covariance E and the concentration graph G, these conditions effectively place 
restrictions on the structure of the network we seek to filter, with © controlling the relative magnitude 
of the eigenvalues of the matrix I — B, and ©, its sparseness. Note that since S is simply the maximum 
degree of G, condition © relates the maximum extent of the degree distribution of G to the sample size 
n. We explore the nature of these conditions in more detail immediately below. 

Remark 3. In general, of course, choice of the Lasso regularization parameter fj, > in © matters. 
The statement of Theorem 1 includes constraints on the range of acceptable values for this parameter. 
In particular, it suggests that fi should vary like a 2 n/S, which for a 2 = 0(n _aJ ) means we want fi = 
0(S^ 1 n~^~ r) ). The theorem does not, however, provide explicit guidance on how to set this parameter 
in practice. For the empirical work shown later in this paper, we have used cross-validation, which we 
find yields results like those predicted by the theorem over a broad range of scenarios. 

Remark 4. There are results in the literature that address other problems sharing certain aspects of our 
network filtering problem, but none that address all together. For example, the bound in (flOl) is like 
that in work by Candes and Tao and colleagues (e.g., flU, (5]|), although for a single regression, rather 
than a system of simultaneous regressions. In addition, those authors use constrained minimization for 
parameter estimation, rather than Lasso-based optimization. As Zhu ||26*ll has recently pointed out, there 
are small but important differences in these closely related problems. Our proof makes use of Zhu's results. 
Similarly, Greenshtein and Ritov lfl5l present results for models that - in principle at least - include the 
individual univariate regressions in (OQ), although again their results do not encompass a system of such 
regressions. Furthermore, their results are in terms of mean-squared prediction error, rather than in terms 
of the regression coefficients themselves. Finally, Meinshausen and Biihlmann EH have studied the use 
of Lasso in the context of Gaussian graphical models, but for the purpose of recovering the topology of 



G i.e., for variable selection, rather than parameter estimation. The proof of Theorem 1 may be found in 
Appendix A. 

In the remainder of this section, we examine conditions © and © in greater depth. These conditions 
derive from our use of certain concentration inequalities, which - although central to the proof of our 
result - can be expected to be somewhat conservative. Our numerical results, shown later, confirm this 
expectation. Nonetheless, these conditions are useful in that they help provide insight into the way that 
the network topology structure, on the one hand, and the sample size n, on the other, can be expected to 
interact in determining the performance of our network filtering methodology. 

A. The eigenvalue constraint 

Recall that the covariance matrix £ is proportional to (I — B)^ 1 . In order to better understand the 
condition on the covariance matrix in ([8]), consider the special case of 

XT 1 — (I — B) — I + qD~ l/2 AD~ 1 ' 2 , (11) 

where A is the adjacency matrix for a graph G, D = diag[(rfj)] ig y is a diagonal matrix, di is the degree 
(i.e., the number of neighbors) of vertex i, and q > is a constant. Here the covariance £ is defined 
entirely in terms of the topology of the concentration graph G. While later, in Sections 3 and 4, we 
use simulation to explore more complicated covariance structures, where the B^ are assigned randomly 
according to certain distributions, the simplified form in (fTTI) is useful in allowing us to produce analytical 
results. In particular, conditions on E reduce to conditions on our network topology^. For example, the 
following theorem describes a sufficient condition under which © holds for this model. 

Theorem 2: Suppose that the covariance matrix E from Theorem 1 is defined through 4771) . with < 
q < 1. Denote 

3 We note that lilt can be rewritten in the form E _1 = (1 + q)I — q ■ C, where C is the (normalized) Laplacian matrix of the graph G. In other words, 
the precision matrix Q = E -1 in this simple model is just a modified Laplacian matrix. 
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ER graph 



BA graph 



Geometric graph 




Fig. 2. Plots of ER, BA, and geometric random graphs of size p = 100 and average degree d = 4. 



where i ~ j indicates that the vertices i and j are neighbors in G and d max = maxi<j< p di is the maximum 
vertex degree. Then condition (|#]) on the eigenvalues of £ is satisfied if 



Proof of this result may be found in Appendix B. The restriction on q ensures that the matrix £ 1 is 
diagonally dominant, which is needed for our proof, although it likely could be weakened. Note that the 
condition in (fT3l) involves the graph G only through the degree sequence {d±, . . . , d p }. More precisely, 
this condition relates the average harmonic mean of neighbor degrees (i.e., rj{) and the maximum degree 
to the sample size n and the constant q. Accordingly, given a network, it is straightforward to explore the 
implications of this condition numerically. For example, we can explore the range of values q for which 
the condition holds, given n. 

Figure [2] shows examples of three network topologies. The first is an Erdos-Renyi (ER) random 
graph[13], a classical form of random graph in which vertex pairs i,j are assigned edges according 
to independently and identically distributed Bernoulli random variables (i.e., coin flips). The degree 
distribution of an ER network is concentrated around its mean and has tails that decay exponentially 
fast. The second is a random graph generated according to the Barabasi and Albert model fl2], which was 
originally motivated by observed structure in the World Wide Web. The defining characteristic of the BA 
model is that the derived network has a degree distribution of a power-law form, with tails decreasing like 



(1 + q) 2 K l + q 



) 2 Vi < V2 ■ 



(13) 
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Fig. 3. Plots of eigenvalue ratio for ER, BA, and geometry graph under different values of q. 

d~ 3 for large d. Therefore, the BA networks tend to contain many vertices with only a few neighbors, and a 
few vertices with many neighbors. Lastly, we also use a geometric random graph model, such as might be 
appropriate for modeling spatial networks. Folio wing EH, vertices in the graph are uniformly distributed 
throughout the unit square [0, l] 2 , and each vertex pair i, j has an edge with probability <p (w^p 1 ^ 2 ), where 
</>(■) is the standard normal density function and is the Euclidean distance in [0, l] 2 between i and j. 
In all three cases, the random graph was of size p = 100 and had average degree d = 4. 

In Figure [3] we show the eigenvalue ratio in ([8]), under the simplified covariance structure in (flTT) . 
for these ER, BA,and geometric random graphs, as a function of q. The horizontal lines represent the 
theoretical eigenvalue ratio bound given by Theorem 1. The open symbols (including the 'plus' symbol) 
indicate graphs that satisfy the condition in Theorem 2, while the filled symbols indicate graphs that do 
not satisfy the condition. We can see from the plot that the condition in Theorem 2 clearly is conservative, 
since as a function of q it ceases to hold long before the inequality in ([8]) is violated. 

B. The sparsity constraint 

The second condition in Theorem 1, given in ©, can be read as a condition on the sparsity r = S/(p—l) 
of the precision matrix Vt (x I — B, and therefore a condition on the sparsity of our network graph G. 
The analytical form of the function p(-) is 

p(r) = (1 + /(4r)) 2 + 2(1 + /(5r)) 2 - 3 , (14) 

where f(r) = y/p/n \^Jr + y/2H(r)^j , and H(r) = — r log(r) — (1 — r) log(l — r) is the entropy function. 
While it is not feasible to produce a closed-form solution in r to the inequality ©, it is straightforward 
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to explore the space of solutions numerically. 

Note that p(r) actually is a function of the three parameters S, p, and n through the two ratios S/(p — 1) 
and n/p. In practice we expect both ratios to be in the interval (0, 1). Shown in Figure |4] is /?(•), as a 
function of r, for a handful of representative choices of n/p. We see from the plot that the theory suggests, 
through condition ®, that the sparsity r should be bounded by roughly 1 x 10~ 4 . Our numerical results, 
however, shown later, indicate that the theory is quite conservative, in that, for example, for our simulations 
we successfully used networks with sparsity on the order of r = 0.04. Analogous observations have been 
made in [@]|. Also shown in Figure |4] is a 3D plot of p(-), as a function of both r and n/p. In this plot, 
the dark area corresponding to the innermost contour line satisfies the condition that p(r) < 1. Again, the 
value of the information shown here is primarily as an indication of the existence of feasible combinations 
of S, p, and n allowing for the accurate estimation of the rows of B. 



Numerical evaluation of the upper bound rho(r) 




x 10"' 




n/p ratio 



Fig. 4. 2D plot and 3D plot showing the behavior of p n / v {r) for three values of the ratio n/p. 



IV. Accuracy of the network filtering 

With the accuracy of B quantified, we turn our attention to the effectiveness of our filtering of the 

network effects. Specifically, in the following theorem we characterize the behavior of </>, defined in ©, 

as a predictor of < p' ldea \ defined in Q. 

Theorem 3: Suppose Y is a p x 1 vector of test data, obtained according to the model defined in (TJ|) 
and d?]). Let = (J — B)Y be defined as in © and let A = B — B. Then conditional on B, <j) has a 
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multivariate normal distribution, with expectation and variance 



E 




(15) 




(16) 



Furthermore, under the conditions of Theorem 1, element-wise we have 



E 



[{^-^)\b] I <||0|| 2 [(c^+fw^-B)- 1 )] 



(17) 



and 



Var 



[4>i | S] < o 2 [ 1 + Ca 2 C+A maa; ((J - B)" 1 ) ] 



with overwhelming probability, where C > an<i Q" are as in Theorem 1. 



Proof of this theorem may be found in Appendix C. Recall that fa deal in §5$ is distributed as a multivariate 
normal random vector, with expectation qb and variance (I — B)a 2 . Equations (fT5l) and <TT6b show that our 
predictor <j> mimics fa deal well to the extent that our error in estimating B - that is, those terms involving 
A - are small. Theorem 1 quantifies the magnitude of the rows A,. = B im — B i9 of A, from which we 
obtain the term Ca 2 (^ in our bounds on the element-wise predictive bias in (flTT) and variance in (fT8l) . 
Remark 4: In the case that there are no external effects exerted upon our system i.e., = 0, the elements 
fadeal Q f ideal estimate fa deal are just identically distributed iV(0, a 2 ) noise. This case corresponds to 
the intuitive null distribution we might use to formulate our detection problem as a statistical hypothesis 
testing problem. The implication of the theorem is that, in using <p rather than < j ) ldea \ following substitution 
of B for B, the price we pay is that the elements fa are instead distributed as iV(0,5f), where the a 2 
differ from a 2 by no more than Ca 2 (^\ max ((I — B)^ 1 ). Treating \ rnax ((I — B)~ l ) as a constant for the 
moment, this term is dominated by Ca 2 ^ i.e., our error in estimating the rows of B. Hence, for example, 
if a 2 = 0(n~ w ) with to > 1, as in Remark 1, then the variances of will also be 0(n~ w ). 

Remark 5: Suppose instead that <p = (0, . . . , 0, fa, 0, . . . , 0)', for some fa > 0. This case corresponds to 
the simplest alternative hypothesis we might use, involving a non-trivial perturbation, and is a reasonable 
proxy for the type of 'genetic perturbations' (e.g., from gene knock-out experiments) considered in 
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Cosgrove et a/.(8]|. Now the bias is potentially non-zero, even for units i with fa = 0. But, again treating 
Amax {(I — B)~ r ) as a constant, and assuming a 2 = 0(n -w ), this bias will be only negligibly worse than 
the 0(ra - ( w-1 )/ 2 ) magnitude of the ideal standard deviation a. And the variance will again be 0(n~ w ). 
Therefore, we should be able to detect single-unit perturbations well for fa sufficiently above the noise. 
Our simulation results in the next section confirm this expectation. 

Now consider the term \ max ((7 — B)" 1 ), which reflects the effect of the topology of G on our ability 
to do detection with network filtering. This term will not necessarily be a constant in n, due to the role of 
n in the bounds ([8]) and © of Theorem 1, constraining the behavior of £ = (7 — 5) _1 a 2 . The following 
lemma lends some insight into the behavior of this term in the case where the precision matrix f2 = S" 1 
again has the simple form specified in (fTTI) . The proof may be found in Appendix C. 

Lemma 1: Suppose that X -1 = 7 + qD~ x l 2 AD" 1 / 2 , as in 4771) . Then 



Remark 6: Because we assume that the network G will be sparse, and that d max < n, the above result 
indicates that the term X max ((7 — 7?)" 1 ) can be treated under our simplified covariance as a constant 
essentially with respect to a 2 (+ in expressions like (fTTT) and (fT8l) . 



In this section, we use simulated network data to further study the performance of our proposed network 
filtering method. The data are drawn from the models for training and test data defined in Section II, 
with randomly generated covariance matrices E. We define these co variances through their corresponding 
precision matrices f2 = which are obtained in turn by (i) generating a random network topology 
G = (V, E), and then (ii) assigning random weights to entries in f2 corresponding to pairs i,j with edges 
{i, j} G E. These collections of weights are then rescaled in a final step to coerce Vl into the form 7 — B 
and, if necessary, to enforce positive definiteness. For the topology G, we use the three classes of random 




(19) 



V. Simulation Results 



A. Background 
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network topologies G = (V, E) described above in Section III.A i.e., the ER, BA, and geometric networks. 
For each choice of network, we use p = 100 nodes, each of which has an average degree of d = 4. The 
adjacency matrices A of the ER and BA model are generated randomly using the algorithms listed in 0, 
while that of the geometric network is generated according to the method described in Il2i1 . 

In implementing our network filtering method, the Lars |fi~2| implementation of the Lasso optimization 
in © was used, on training data sets of various sample sizes for each network. The Lasso regularization 
parameter \x was chosen by cross-validation. To generate testing data, we used single-unit perturbations 
of the form = (0, . . . , 0, 0*, 0, . . . , 0)', where 0* > is in the i-th position, for each % = 1, . . . , 100. 
Since a 2 in our simulation is effectively set to 1, 0* can be interpreted as the signal-to-noise ratio (SNR) 
of the underlying perturbation. In our simulations, we let 0* range over integers from 1 to 20. Our final 
objective of detection is to find the position of the unit at which the external perturbation occurred. In our 
proposed network filtering method, we declare the perturbed unit to be that corresponding to the entry of 
with largest magnitude i.e., i = argmaxi<j< p |0j|. 

In each experiment described below, our method is compared with two other methods. The first, called 
'True', is that in which the ideal < p' ldeal is used instead of 0, which presumes knowledge of the true B. 
The second, called 'Direct', is that in which the actual testing data Y i.e., the data without network 
filtering, are used instead of 0. In both cases, we declare the perturbed unit to be that corresponding to 
the entry of largest magnitude. The 'True' method gives us a benchmark for the detection error under 
the ideal situation that we already have all of the network information, while the 'Direct' method is a 
natural approach in the face of having no information on the network. By comparing our method with the 
two, we may gauge how much is gained by using the network filtering method. In all cases, performance 
error is quantified as the fraction of times a perturbed unit is not correctly identified i.e., the proportion 
of mis-detections. Results reported below for all three methods are based in each case upon 30 replicates 
of the testing data. Our plots show average proportions of mis-detections and one standard deviation. 
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B. Results 

First we present the results from an experiment where X is defined according to the simple formulation 
given in (fTTI) . the definition that underlays the results in Theorem 2 and Lemma 1. That is, we define 
S in terms of just the (random) adjacency structure of our three underlying networks, scaled by an 
appropriate choice of q to ensure positive definiteness. We may think of this case, from the perspective of 
the simulation design described above, as one with a particular non-random choice of weights for edges 
in the network G i.e., where B = —qD~ x l 2 AD~ x l 2 . 

Figure |5] shows the average proportions of mis -detections, as a function the SNR, for these three models. 
Note that since the underlying graphs are random, there is some variability in such detection results 
from simulation to simulation. However, these plots and the others below like them are representative 
in our experience. From the plots in the figure, we can see that in all cases the network filtering offers 
a significant improvement over the 'Direct' method, and in fact comes reasonably close to matching 
the performance of the 'True' method, with mis-detections at a rate of roughly 5-25% for high SNR. 
Performance differs somewhat with respect to networks of different topology. The network filtering method 
shows the most gain over the 'Direct' method with the BA network. This phenomenon is consistent with 
our intuition: the distribution of edges in the BA network is the least uniform one, and certain choices 
of perturbed unit (i.e., perturbed units % with large degree di) will enable the effects of perturbation to 
spread comparatively widely. Hence obtaining and correcting for the internal interactions among units in 
the network is particularly helpful in this case. 

Now consider the assignment of random weights to edges in G, which allows us to generate a richer 
variety of models. For this purpose, we choose the family of beta distributions Beta(a, b) from which to 
draw weights B^ independently for each edge {i, j} E E. Three different classes of distributions were 
used i.e., Beta(l, 1), Beta(l/2, 1/2), and Beta(2,2), which gives flat (uniform), U-shape, and peaked 
shape forms. Shown in Figure [6] are the results of our network filtering method, the 'True' method, and 
the 'Direct' method, for each of these three choices of weight distributions, for each of the three network 
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Fig. 5. Plots of the proportion of mis-detections versus signal-to-noise ratio, for the BA, ER and geometric random networks, based on the simplified 
covariance model in jilt , using q = 1.25, p = 100 and n = 50. Error bars indicate one standard error over 30 test datasets. 

topologies. The same (random) network topology is used in each plot for each type of network. 

Broadly speaking, these plots show that the performance of network filtering in the context of randomly 
generated edge weights B^, as compared to that of the 'True' and 'Direct' methods, is essentially 
consistent with the case of fixed edge- weights underlying the plots in Figure \5\ However, there are some 
interesting nuances. For example, in the case of 'Flat' weights, network filtering in fact is able to match 
the performance of 'True' for all three classes of graphs. On the other hand, in the ER random network 
topology this matching occurs only when the edge- weight distribution is flat (i.e., Beta(l, 1)), and in the 
BA random network topology, when the distribution is either /7-shaped (i.e., Beta(l/2, 1/2)) or flat (i.e., 
Beta(l, 1)). Nevertheless, the qualitatively similar performance across choice of edge-weight distribution 
suggests that most important element here is the network structure, indicating connection between pairs 
of units, with the strength of connection being secondary. 

Finally, we consider the effect of sample size n and, therefore implicitly, the extent to which the condition 
in ([8]) on the structure of the covariance matrix £ may be relaxed. For the same networks used in the 
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Fig. 6. Plots of proportions of mis-detections versus signal-to-noise ratio. Columns: BA (left), ER (middle), and geometric (right) random networks. Rows: 
U-shaped (top), flat (middle), and peaked (bottom) choice of weight distributions, generated according to Beta(l/2, 1/2), Beta(l, 1) and Beta(2,2) 
distributions, respectively. 

simulations described above, with p = 100 units, we varied the sample size n to range over 20, 50, 100, 
and 150. Weights of the network edges are set according to a Beta(2, 2) distribution, which is the 'peak' 
case. Training and testing data were generated as before. The results of using network filtering in these 
different settings are shown in Figure Ul Again, our network filtering method is seen to work similar to 
above. Even for a sample size as small as n = 20, our method still does better than the 'Direct' method 
in all three models, particularly under the BA and ER models^. 
On a final note, we point out that in all of the experiments the richness of network models studied is 

4 We note that some care must be used in fitting Lasso with n = p, due to numerical instabilities that can arise. This issue affects any method attempting 
to estimate the inverse of a covariance matrix (as is implicitly being done here). Kramer 1171 describes how a re-parameterization of the Lasso penalty can 
be used to avoid this problem. 
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Fig. 7. Plots of proportion of mis-detections versus signal-to-noise ratio for the BA (left), ER (middle), and geometric (right) random network models, for 
sample sizes n = 20, 50, 100, and 150. 

much broader than suggested by our theory. As was mentioned earlier, the concentration inequalities we 
use can be expected to be conservative in nature, and therefore some of the bounds are more restrictive 
than practice seems to indicate is necessary. For example, in our simulations involving the geometric 
random graph in Figure [2l with a sample size of n = 50 and S = 9, the theoretical bound ([8]) for the 
eigenvalue ratio is 6.12, while the actual value achieved by this ratio is 219 in this instance. Also, the 
maximum degrees of the graphs in most of the simulations are larger than the average degree 4, and hence 
the sparseness rate S/p > 4/100, which is already larger than those theoretical sparse rates suggested by 
Figure 1. Yet still we observed the network filtering method to perform quite well. It is an interesting 
open question to see if the theory can be extended to produce bounds like © that more accurately reflect 
practice, to serve as a better practical guides for users. 



VI. Discussion 

The concept of network filtering considered in this paper was first proposed by Cosgrove et al. flU, 
as a methodology for filtering out the effects of 'typical' gene regulatory patterns in DNA microarray 
expression data, so as to enhance the potential signal from genetic targets of putative drug compounds. 
Here we have formalized the methodology of Cosgrove et al. and established basic conditions under which 
it may be expected to perform well. Furthermore, we have explored the implications of these conditions 
on the topology of the network underlying the data (i.e., a Gaussian concentration graph). Proof of our 
results rely on principles and techniques central to the literature on compressed sensing and, therefore, 
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like other results in that literature, make performance statements that hold with over- whelming probability. 
Numerical simulation results strongly suggest a high degree of robustness of the methodology to departure 
from certain of the basic conditions stated in our theorems regarding network topology. Our current work 
is now focused on the development of adaptive learning strategies that intentionally utilize perturbations 
(i.e., in the form of the vectors <p) to more efficiently explore network effects (i.e., the matrix B). 

Appendix A 
Proof of the First Zonklar Equation 

Appendix B 
Proof of Theorem 1 

Theorem 1 jointly characterizes the accuracy of p simultaneous regressions, each based on the model 
in (OQ) i.e., for % — 1, . . . ,p, 

p 

For convenience, we re-express the above model for a single regression in the generic form 

R = Xf3 + e . (21) 

Here X is a n x (p — 1) design matrix with rows sampled i.i.d. from a multivariate normal distribution 
iV(0, C), with (p — 1) x (p — 1) covariance matrix C; e is an n x 1 error vector, independent of X, with 
i.i.d. iV(0, a 2 ) elements; and R is the fixl response vector. 

We will make use of a result of Zhu 11261 . which requires the notion of restricted isometry constants. 
Following Zhuj, we define the S'-restricted isometry constant 8s of the matrix X as the smallest quantity 
such that 

(l-<fe)||c|| 2 < ||X r c|| 2 < (l + 5 5 )||c|| 2 (22) 
for all index subsets T C {1, . . . ,p} with \T\ < S. Zhu's result is then as follows. 

5 This definition differs slightly from that in Candes (4)- See 1261 for discussion. 
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Lemma 2 (Zhu): If(i) the number of non-zero entries of ft is no more than S, (ii) the isometry constants 
64s and 5$s obey the inequality 64s + 25^s < h and ( Hi) the Lasso regularization parameter ji obeys the 
constraint p? < Co(/S, for ( = ||e|||, then 

\\P-P\\l<CC, (23) 

where 

P = argmm\\Xp-R\\l + //||/3|| a . (24) 

Zhu's first condition is assumed in our statement of Theorem 1. Therefore, to prove Theorem 1 we need 
to show, under the other conditions stated in our theorem, that Zhu's second and third conditions above 
hold simultaneously for each of our p regressions, with overwhelming probability. In addition, we need 
to show that the right-hand side of (1231) is bounded above by the right-hand side of (TTOl) . 

A. Verification of Lemma [2] Condition ( ii) 

The essence of what is needed for the restricted isometry constants is contained in the following lemma. 

Lemma 3: Suppose Ct is a sub-matrix of the covariance matrix C with columns variables corresponding 
to these in C of the indices in set T, where \T\ = S. Denote the largest and smallest eigenvalues of any 
such matrix Ct as A max (Or) and A min (Or) respectively. Suppose too that 

u« - [T^m ) Pi ) • ( } 

where p(r) is defined as in H4\) . Then the condition 64s (X) + 2^55 (X) < 1 holds with overwhelming 
probability. 

The covariance matrix C corresponding to any single regression is a sub-matrix of £ in Theorem 1, 
and hence so is any sub-sub-matrix Ct- By the interlacing property of eigenvalues (e.g., Golub and van 
Loan ||T4l Thm 8.1.7]), which relates the eigenvalues of a symmetric matrix to those of its principle 
sub-matrices, as long as S satisfies the eigenvalue constraint ([8]), the matrices C T will as well. So it is 
sufficient to prove Lemma [3] 
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Proof of Lemma Let X T denote the n x S sub-matrix of X corresponding to the subset of indices 
T. Since the rows of X are independent samples from iV(0, C), the rows of Xt are independent samples 

1 /2 

from iV(0, Ct). Let <Xj be the i-th largest singular value of C T and &i be the i-th largest singular value of 
rr x l 2 X T . The eigenvalue condition in the lemma reduces to [pxl^s] < [l + (S/n) 1 ' 2 ] / [l — (S/n) l l 21 \. 
Without loss of generality, therefore, assume that o\ < 1 + (S/n) 1 / 2 while a s > 1 — (S/n) 1 / 2 . 

Note we can express X T as X T = ZC^ 2 , where Z ~ N(0,I). Then X^X T = [C]( 2 ]'Z'ZC]! 2 and 
hence the eigenvalues of X' T X T are the same as those of Z'ZCt- Thus we have 

X' T X T \ ( Z'Z\ 2 



Amax ( — - — J < A max y—^- ) ■ a i (26) 

Amin I I — A m j n 

( I • °> ( 27 ) 



n / \ n 

Let a* denote the i-th largest singular value of n~ x l 2 Z. Therefore we havd_ 

o-i < o\-o x (28) 
«r s > a* s -a s (29) 

Denote by cx m i n (-), a max (-) the smallest and largest singular values of their argument. Notice that for any 
index set T* C T, we have 

- ( X A^~ ( Xt *\^z (X t *\ (X t \ 
VvW \Vn J \VnJ \VnJ 

Thus we need only to consider the situation where \T\ = S and choose S as the smallest constant that satisfies i22\ for any sub-matrix 
Xt of size n x S. Therefore, we set 1 — S < as < &i < 1 + S, where a\ = sup| T | =s a\ and as = inf |t|=s <5"s- It then follows that 
8 < max(l — as, ai — 1). 

Now, by the large deviation results in 1201 . (4), for a standard Gaussian random matrix Z ~ iV(0, /), there are two relevant concentration 
inequalities: 

P (ffj > 1 + sfsj^ + v + tj < e~ nt2/2 (30) 
P (a s < 1 - v/SA* - V - t) < e~ nt2/2 , (31) 



where 77 is an o(l) term. 

6 (^s) 2 e 1 ua l s me smallest eigenvalue of Z'Z/n, which is A m ;„ in (4). Similar for (<r*) 2 . 
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We can then use the above tools and concentration inequalities to see how S behaves under the conditions described in Lemma [3] Notice 
that, for e > 0, we have 

P{l + S> (i + (l + e)/(r)) 2 ) (32) 

< P(max(2-a s ,ai) > (1 + (1 + e)/(r)) 2 ) (33) 
= P({2-a s > (l + (i + e)/(r)) 2 }U (34) 

{ax > (l + (l + e )/(r)) 2 }) (35) 

< P(a s <2-(l + (l+e)/(r)) 2 )) (36) 
+P(<ri > (l + (l + e)/(r)) 2 ) (37) 

Denoting 7 = S/n, we have by l |29l > that as > (1 — y/^D&s- Therefore, for the term with as in {37}, we have by Bonferroni's inequality 
that 

P(? s <2-(l + (l + £ )/(r)) 2 )) 

< Yl P(MXT)<2-(l + (l + e)f(r)) 2 )) 

{|T|=S} 

< |{|T| = S}\P ((1 - V7) • *S < 2 - (1 + (1 + £)/(r)) 2 )) 

< C(p,5). 

P ((1 - Vt) • *s < 1 - 2(1 + e)f(r) - ((1 + e)f{r)f) 

< C{p,S)P{a% <1-(1 + e)/(r)) . 

As in |4], we fix rj + t = (l + e)- s /pJn- s /2H(r), from which it follows that (l + e)/(r) ~ sf%+V + t and C(p, s)e"™' 2/2 < e-P^f 1 -)^ 2 . 
Hence the above inequality is equivalent to 

P{a s <2-(l + (l + s)f(r)) 2 )) 

< C(p,g)pfo<(l-y/Sfii-T)-t)) 

< C(p,S)e- nt2/2 

< e -pH(r)e/2 

For the term with a\ in J37b . the analogous inequality 

P(cri > (1 + (1 + e)f{r)f) < e - pH{r)e/2 

may be verified using a similar argument. Combining these two probability inequalities for a\ and as, we have that 
P (l + 5 > (1 + (1 + e)/(r)) 2 ) < 2 • e~ pH ' r ' e ' /2 . Ignoring the negligible e terms, it follows that when p,n is large enough S < 
(1 + f(r)) 2 - 1 holds with overwhelming probability. Defining p(r) = (1 + /(4r)) 2 + 2(1 + /(5r)) 2 - 3, we have that S4.S + 2S 5 s < p(r). 
Imposing the condition p(r) < 1, we obtain that 64s + 2<$ss < 1, and therefore Lemma [3] is proved. 
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B. Verification of Lemma [2] Condition ( Hi) and the right-hand side of ( |70| ) 

Let = X^B^ + e^*' denote the regression equation J2U for the i-th of the p simultaneous regressions in i20\ . Condition (iii) of 
Lemma [2] requires that the regularization parameter be such that /i 2 < CoC /S, where £^ = He' 1 '!! 2 .. If so, and assuming of course that 
conditions (i) and (ii) are satisfied as well, then the inequality in J23b says that ||/3' 1 ' — B ||1 < C^'Q 1 "' . We show that the condition on 
in the statement of Theorem 1 i.e., that fi 2 < (Coa 2 ^)/S, guarantees that Condition (iii) holds for every i = 1, . . . ,p with overwhelming 
probability. In addition, we show that C (i) C W < Ca 2 (+ with overwhelming probability, where = n ^1 + 4(log 2 n/n) 1//2 J , and C is 
bounded by (1 — p(r))~ 2 times a constant, as claimed in Remark 1. 

Notice that if we have e ~ iV(0, a 2 I n xn), then 1 1 e 1 1 § is distributed as chi-square on n degrees of freedom. By 1181 Sec 4.1, Lemma 4], 
for t' > 0, 

P | ||e||l - no 2 < -4<T 2 VnF} < exp(-t') 

and 

P | ||e||2 - no 2 > 4a 2 VnF} < exp(-t') . 

Therefore, 

p (mm H e ' J 2 < (1 _ 4yFM)l 
[ i rwr 2 J 

<£*{^<(i-vto} 

!=1 ^ 

= p p/JN|< (l_4V*V»»)} <pexp(-i') , 

and similarly, 

f || e M|| 2 , 1 

P^max^ P- > (1 + 4JF/n) } < pexp(-t') . 

[ i no 2, J 

We choose t' so that /i 2 < Co 1 1 e^-' 1 1 i uniformly in i with probability at least 1 — 2e~ pH ^ t ^ 2 , so as to match the rate in Section A 
above. Specifically, we set t' — [pH(r)e] /2 + log (p/2). Hence, for sufficiently small e we have t' ~ log(p/2). Therefore, as long as 
< {C o 2 n/S) \l - 4(t'/n) 1/2 j, then with probability exceeding 1 - 2e" pH(r)e/2 , the inequality fi 2 < C \ |e w | \l/S holds uniformly 
in i. Suppose p = n v ', with z/ > 1. Under this condition, i'/n ~ (log 2 n' y ) /n = v (log 2 n) /n and our requirement thus reduces to 
H 2 < (C a 2 n/S) — 4 ((log 2 n)/n) 1/2 J . Similarly, with t' ss log (p/2), we also have with probability exceeding 1 - 2e~ pH(r)e/2 that 
||e (l) ||l <o 2 n [l + 4((log 2 n)/n) 1/2 ]. 

Let ("n an d Cn be defined as in the statement of Theorem 1. Then by requiring that fi 2 < (Coa 2 £n)/S, from the above results it follows 
that Condition (iii) of Lemma 2 holds for all i = 1, . . . ,p, with high probability. Furthermore, with high probability, He^'H 2 < cr 2 £+, for 
i — 1, . . . ,p. Therefore, by the bound l |23t in Lemma 2, we have established the bound J 1 01 > in Theorem 1, except for the constant C. 

Specifically, it remains for us to establish that C (l ' < C for all i. Denote the value C^ 1 ' for an arbitrary regression by C . Note that the 
C here in our paper corresponds to the square of what is called 'C in Zhu 1261 , Hence by equation (17) in 1261 , C 1 ' 2 is smaller than the 
larger root of a quadratic equation of the form 

a 2 z 2 - (2ac + b)z + (c 2 - r) = , 
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where z is the argument and a, b, c and t are positive parameters |_| For our purposes, it is enough to remark that a > (1 — 84,3 — 2(?5s)/3, 
b is bounded by a constant proportional to C\^ 2 , c relates to a through the expression a — c ^ZpS 1 / 2 /( 1/,2 J — 1 — 84s, and r is a constant 
greater than four, 

As the larger root of the above quadratic equation, 



- 1/2 2ac + b + v /(2ac + &) 2 -4a 2 (c 2 -r) 

(_y ^ — r . 

2a 2 

Note that (2ac + b) 2 — 4a 2 (c 2 — r) = 4a6c + 6 2 + 4a 2 r, which is bounded by max((2ar 1//2 + b) 2 , (2ac + 6) 2 ), because a, b, c and r are 
all positive. Hence we have 

xAii „ / 2ac + 6 + 2<zt 1/2 + b 2ac + b + 2ac+b 
C < max , 

2a 2 ' 2a 2 



[t 1/2 +c b 2c 6 

< max 1 -, 1 - 

\ a a z a 

< 2 max(c ' rl/2) +A . (38 ) 

a a z 

It remains for us to bound the right-hand side of ( 138b . Recall that, by construction, <54s + 2<55s < p(r), and that p(r) is assumed strictly 
less than 1. Thus, because b is bounded by a term proportional to Cq^ 2 , the second term in the right-hand side of d38fc is bounded by a term 
proportional to (1 — p(r)) 2 . Furthermore, if r 1//2 > c, then the first term is bounded by a term proportional to (1 — p{r)). Lastly, therefore, 
suppose that c > r 1//2 and consider the term 

c 1 a + l + S 4S 

a 2/iS 1 /2/£i/2 a ■ ^ y > 

The term involving a in the right-hand side of l |39l > is easily bounded, per our reasoning above, while - taking, for example, p 2 = CoC/S 
in the condition of Lemma[2] the other term is equal to (4Co) -1 . 

Hence, returning to the context of our original problem, for each i = 1, . . . ,p, the constant C^' is bounded by some constant times 
(1 — p(r))~ 2 . Letting C be the largest of these bounds, our proof of Theorem 1 is complete. 

Appendix C 
Proof for Theorem 2 

To show that condition <[8j of Theorem 1 holds in the context of Theorem 2, we first bound the eigenvalue ratio of the covariance matrix 
E. For q £ (0, 1), the matrix S _1 is diagonally dominant, and hence, by the Levy-Desplanques Theorem, non-singular. Furthermore, since 
E _1 is real and symmetric, it is a normal matrix. Therefore, 

A m ax(S) l/A m i n (S ) A ma x(S 1 ) 1 ) 

A m in(S) l/A ma x(S — 1 ) A m i n (Ij — 1 ) 

where ac2(S _1 ) is condition number of the precision matrix E _1 . As a result, by an inequality of Guggenheimer, Edelman, and Johnson 1161 
Pg. 4] for condition numbers, we can bound our eigenvalue ratio as 

US) < 2 ( \\z^ L Y /2 

A min (E) " |det(E- 1 )| V P J 
7 Our notation here is slightly different from that of 1261 . 
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Since E^ 1 = I + qD~ 1/2 AD~ 1/2 , direct calculation shows that 

HE- 1 ^ = p + q 2 \\D-^AD-^\\l=P + <l 2 J2( -7= ? 

— \/didj 
s ~j v J 

where dt is the i-th element of the diagonal matrix D. As for the quantity | det(E _1 )|, we note that 

|det(E _1 )| = |qdet(D~ 1/2 )det(l/q,D + A) det(C 1/2 )| 

= q"f[d- X \tet{l/qD+A)\ . 

i=l 

Denoting M — (l/q)D + A, and applying a result of Ostrowski for determinants of diagonally dominant matrices (e.g., 1221 ). we find that 

v p , 

i det(AQi > n(i M «i + E l M «U = + di ) 

'l + g s 



n* 



Hence, we have det(E _1 ) > (1 + q) p - 

Combining the relevant expressions above, we have that 



„ 2 v-p \ih \ p/2 



^<J- 1+ «^ - I (41) 
A min (E) " + p I ^ 

p/2 



= 2 



(42) 



.(l + q) 2 v l + g 

Denoting rji and 772 as in dl2t . bounding the right-hand side of l !42t by j^l + (d m ai/"-) 1//2 J / ^1 — {d ma x /n) 1 ^ 2 ^ , to enforce the bound {§}, 
and some trivial manipulation of the resulting inequality, yields the condition in J 1 3 b , as desired. 

Appendix D 
Proof for Theorem 3 

Note that the difference of the predictor (j> and the true external effect (f> is given as <j> — <j> — Y — BY — 4> = (I — B)Y — <j>. So the bias 
term is just 

E[(4> - <j>)\B] = (I - B)(I - B)- 1 ^ - 4> 

= [(I -B)-(I- B)](I - B)~ X (j) = A(J - B)~ V , 

where A = B — B, and hence for the i-th component of the bias term, we have 



E 



(0i — 00 | | = \Ai.{I - B)- 1 ^ 

< \\A im \\ 2 IKi-s)- 1 ! 



< 



[(Ca 2 C+) 1/2 xUKil-B)- 1 )] U\\ 2 . 
Here the first term in brackets follows from Theorem 1, while the second follows from the definition of the £2 matrix norm. 
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For the variance of the predictor <f>, we have 

Var | fi] = (J — B)(I — B)~ 1 (I — B) T a 2 



(I-B + B- B)(I - B) (/ — B + B — B) T a 
(I - B)a 2 + \a(I - B) _1 A T + 2Al a 2 . 



So the variance of the zth element fa of <f> is 

Var [fa j fi] = a 2 + A».(/ - B^A^a 2 , 
since = Bu = 0. The absolute value of the second term is bounded by (Ccr 2 ^) X m ax ((I — £>) _1 ), and thus J 1 8t follows. 

Appendix E 
Proof for Lemma 1. 

Under the model E" 1 = I + qD^ 1 ^ 2 AD^ 1 ^ 2 , the largest number of non-zero entries S is d ma x, the maximal degree of the network. So 
by the eigenvalue condition in Theorem 1, we have 



A max ((/ - B)- 1 ) < A min ((J - B)- 1 ) ■ ( 1 + ^ B ) 



Now A min ((/ - B)- 1 ) = A m L((J - B)) = A m L(/ + qD' 1 ' 2 AD' 1 / 2 ). And clearly A max (J + qD- 1 / 2 AD' 1 ' 2 ) = 1 + 
X ma , x (qD- 1/2 AD- 1/2 ). Furthermore, 

A max (-D ' ' ) = max 



= max 



(D-^ 2 xYA(D^ 2 x) {D^^xyjD- 1 ' 2 ! 
(D- 1 / 2 x)'(D- 1 / 2 x) ' x'x 



> A max (A)A mln (D- 1 ) = A 7 x(4) 



By Lem. 8.6], A max (A) > dUL Therefore, A max (J + qD~ 1/2 AD~ 1/2 ) > 1 + qd 



max 



Combining these results, we have 



A^I-B)- 1 ) < j—^—i |'l±\/pa*p 

9 Vd mnx \ 1 ~~ V "max/n 



max \ 1 \jdxi\&x. j 
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